Introduction

About our data:

Our YouTube data set has 161,470 records and 17 variables. The variables in this dataset are: “video_id”, “trending_date”, “title,”channel_title“,”category_id" “publish_date”, “time_frame”, “published_day_of_week”, “publish_country”, “tags”, “views”, “likes”, “dislikes”, “comment_count”, “comments_disabled”, “ratings_disabled”, and “video_error_or_removed”. The variables have various data types that we will be using in our analysis, including character, integer, boolean, and factors. Each row in the data set is a particular trending video on a specific trending date. Additionally, four separate countries are analyzed, the United States, Canada, Great Britain, and France. Each country has its own list of trending videos on each day. The trending videos are taken from November of 2017 to June of 2018. According to Google (the owner of YouTube), the trending list is updated approximately every 15 minutes (Citation Needed). Thus, the number of videos that are trending throughout a day fluctuates. The number of videos on the trending list at any given time is around 200 in each country.

Questions

Our questions focus on five main areas of focus:

Data Cleaning

For the most part, our data set is quite user friendly. When loading the data, R automatically assigns certain data types. However, some of the automatic data types assigned are not helpful for future analysis and were changed.

We also had to clean the category_id variable. The raw data assigns a number to each category. We researched what categories these numbers corresponded to and relabeled the data using the category names. Using Youtube’s API (https://gist.github.com/dgp/1b24bf2961521bd75d6c), we relabeled the numbers to factors.

Categories:

Is there a discernible trend in the usage of question marks?

Regional Differences: Exploring Across the Sea

Publishing Time throughout week by country?

What is the weird blank line in the middle????

## # A tibble: 10 x 2
##    trending_date nbr_trending_videos
##    <date>                      <int>
##  1 2018-04-01                    780
##  2 2018-04-02                    780
##  3 2018-04-03                    789
##  4 2018-04-04                    786
##  5 2018-04-05                    792
##  6 2018-04-06                    798
##  7 2018-04-07                    794
##  8 2018-04-14                    790
##  9 2018-04-15                    785
## 10 2018-04-16                    781